Markov Chain Monte Carlo


Lecture 14

March 18, 2024

Last Class

Bayesian Computation

  • Goal: Sample from posterior distribution to:
    • Capture parametric uncertainty;
    • Compute MAP estimates.
  • Challenging because of arbitrary nature of distributions.

Markov Chain Strategy

  • Generate an appropriate Markov chain so that its stationary distribution of the target distribution \(\pi\);
  • Run its dynamics long enough to converge to the stationary distribution;
  • Use the resulting ensemble of states as Monte Carlo samples from \(\pi\) .

Markov Chain Convergence

Given a Markov chain \(\{X_t\}_{t=1, \ldots, T}\) returned from this procedure, sampling from distribution \(\pi\):

  • \(\mathbb{P}(X_t = y) \to \pi(y)\) as \(t \to \infty\)
  • This means the chain can be considered a dependent sample approximately distributed from \(\pi\).
  • The first values (the transient portion) of the chain are highly dependent on the initial value.

Metropolis-Hastings Algorithm

The Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm:

  • The foundational MCMC algorithm (and was named one of the top ten algorithms of the 20th century).
  • Builds a Markov chain based on transitions by:
    • generating proposals for new samples from a conditional proposal distribution \(q(y | x)\);
    • accepting or rejecting those proposals.

The Metropolis-Hastings Algorithm

Given \(X_t = x_t\):

  1. Generate \(Y_t \sim q(y | x_t)\);
  2. Set \(X_{t+1} = Y_t\) with probability \(\rho(x_t, Y_t)\), where \[\rho(x, y) = \min \left\{\frac{\pi(y)}{\pi(x)}\frac{q(x | y)}{q(y | x)}, 1\right\},\] else set \(X_{t+1} = x_t\).

How Simple Is That?

The devil is in the details: performance and efficiency are highly dependent on the choice of \(q\).

Key: There is a tradeoff between exploration and acceptance.

  • Wide proposal: Can make bigger jumps, may be more likely to reject proposals.
  • Narrow proposal: More likely to accept proposals, may not “mix” efficiently.

Proposal Distribution Choice

The original Metropolis et al. (1953) algorithm used symmetric distributions (\(q(y | x) = q(x | y)\)).

Then the acceptance probability reduces to \[\rho = \min \left\{\frac{\pi(y)}{\pi(x)}, 1\right\}.\]

A common choice: \(y \sim \text{Normal}(X_t, \sigma)\) centered around the current point \(X_t\).

Julia Implementation

function mh_transition(x_current, σ)
    # generate new proposal
    x_proposal = rand(Normal(x_current, σ))
    u = rand()
    ρ = log(target_density(x_proposal)) - log(target_density(x_current)) # transition log-probability
    if log(u) < min(ρ, 1)
        y = x_proposal
    else
        y = x_current
    end
    return y, log(target_density(y))
end
mh_transition (generic function with 1 method)

Julia Implementation

function mh_algorithm(n_iter, σ, x₀)
    # initialize storage
    samples = zeros(n_iter) 
    log_target = zeros(n_iter)
    samples[1] = x₀ # start algorithm
    log_target[1] = log(target_density(x₀))
    accept_count = 0
    for i = 2:length(samples) # iterate
        samples[i], log_target[i] = mh_transition(samples[i-1], σ)
        if samples[i] != samples[i-1]
            accept_count += 1
        end
    end
    accept_rate = accept_count / n_iter # compute acceptance rate
    return samples, log_target, accept_rate
end
mh_algorithm (generic function with 1 method)

Linear Regression Example

\[ \begin{gather} y = 5 + 2x + \varepsilon \\ \varepsilon \sim \text{Normal}(0, 3) \end{gather} \]

Code
# create trend for data
x = rand(Uniform(0, 20), 20)
y = 5 .+ 2 * x
# sample and add noise
ε = rand(Normal(0, 3), 20)
y .+= ε

p = scatter(x, y, label="Data", xlabel=L"$x$", ylabel=L"$y$", markershape=:o, markersize=10, tickfontsize=16, guidefontsize=18, legendfontsize=16, bottom_margin=10mm, left_margin=5mm, right_margin=5mm) 
plot!(p, size=(600, 550))
Figure 1: Regression data plot

Model Specification

\[ \begin{gather} y = a + bx + \varepsilon \\ \varepsilon \sim \text{Normal}(0, \sigma). \end{gather} \]

This makes the likelihood: \[ y \sim \text{Normal}(a+bx, \sigma). \]

Prior Selection

\[ \begin{gather} a \sim \text{Normal(0, 2)} \\ b \sim \text{Normal(0, 2)} \\ \sigma \sim \text{Half-Normal}(0, 1) \end{gather} \]

Code
# generate data for samples
function gen_data(a, b, σ)
    x = collect(0:20)
    y = a .+ b * x
    # sample and add noise
    ε = rand(Normal(0, σ), length(x))
    y .+= ε
    return y
end

# sample and plot
n_samples = 1000
a = rand(Normal(0, 2), n_samples)
b = rand(Normal(0, 2), n_samples)
σ = rand(truncated(Normal(0, 1), lower=0), 1000)
y_prior = [gen_data(a[i], b[i], σ[i]) for i in 1:n_samples]
# convert y to a Matrix by vcatting each vector
y_prior = mapreduce(permutedims, vcat, y_prior) 
plt_prior_1 = plot(; ylabel=L"$y$", xlabel=L"$x$",
    tickfontsize=16, legendfontsize=16, guidefontsize=18, bottom_margin=5mm, left_margin=5mm, legend=false)
for x  [0, 5, 10, 15, 20]
    boxplot!(plt_prior_1, [x], y_prior[:, x+1], color=:blue)
end
plot!(plt_prior_1, size=(600, 550))
plt_prior_1
Figure 2: Prior predictive plot for regression example.

Proposal Distribution and Initial Value

To illustrate how the M-H algorithm works, let’s use a proposal \[\mathcal{N}(x_t, 0.01I_3).\]

And let’s start at \[x_0 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\].

First Proposal

Current:

\[X_0 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\]

\[\text{log-posterior} = -2851\]

Iteration:

\[y = \begin{pmatrix}0.94 \\ 1.07 \\ 0.82\end{pmatrix}\]

\[\text{log-posterior} = -3433\]

\[\rho \approx 0 \Rightarrow X_1 = X_0\]

Another Proposal

Current:

\[X_1 = \begin{pmatrix}1 \\ 1 \\ 1\end{pmatrix}\]

\[\text{log-posterior} = -2851\]

Iteration:

\[y = \begin{pmatrix}1.24 \\ 1.05 \\ 1.04\end{pmatrix}\]

\[\text{log-posterior} = -2165\]

\[\rho =1 \Rightarrow X_2 = y\]

1,000 Iterations

Code
samples, lpost, α = mh_algorithm(100000, [1; 1; 1], x, y)

p = plot(samples[1:1000, :], layout=(1, 3), label="Samples", guidefontsize=18, tickfontsize=16, legendfontsize=16, bottom_margin=15mm, ylabel="Value", xlabel="Iteration", left_margin=10mm)
hline!(p, [5 2 3], color=:red, linestyle=:dash, label="True Value")
plot!(p, size=(1300, 400))
Figure 3: First 1,000 iterations of the MCMC example

100,000 Iterations

Code
p = plot(samples, layout=(1, 3), label="Samples", guidefontsize=18, tickfontsize=16, legendfontsize=16, bottom_margin=15mm, xlabel="Iteration", left_margin=10mm, xticks=0:50000:100000, right_margin=15mm, ylabel = [L"$a$" L"$b$" L"$\sigma$"], legend=[:false :bottomright :false])
hline!(p, [5 2 3], color=:red, linestyle=:dash, label="True Value", linewidth=3)
plot!(p, size=(1300, 400))
Figure 4: 100,000 iterations of the MCMC example

Acceptance rate: 0.50435

Marginal Distributions

Code
p = histogram(samples[10001:end, :], layout=(1, 3), label=false, guidefontsize=18, tickfontsize=16, legendfontsize=14, bottom_margin=15mm, ylabel="Count", left_margin=10mm, xlabel = [L"$a$" L"$b$" L"$\sigma$"], legend=[:false :false :topright], color=:lightblue)
vline!(p, [5 2 3], color=:red, linestyle=:dash, label="True Value", linewidth=3)
vline!(p, mapslices(mean, samples[10001:end, :], dims=1), color=:purple, linestyle=:dash, label="Posterior Mean", linewidth=3)
plot!(p, size=(1300, 450))
Figure 5: 100,000 iterations of the MCMC example

Considerations for Implementation

Proposal Distribution Example

Code
# target density: modified Normal(0, 1) PDF
function target_distribution(x) 
    return sin(x)^2 * sin(2x)^2 * pdf(Normal(0, 1), x)
end

# compute normalizing constant for normalization
marg_dens, error = quadgk(x -> target_distribution(x), -Inf, Inf)
# plot target density
x = -π:0.01:π
p_base = plot(x, target_distribution.(x) ./ marg_dens, linewidth=3, label="Target Density", tickfontsize=16, legendfontsize=16, guidefontsize=18, bottom_margin=5mm, left_margin=5mm)
plot!(xlabel=L"x", ylabel="Density")
# pick current value
x_current = 0.5
vline!([x_current], color=:black, linewidth=2, label=L"$x_t$")
Figure 6: Example target density for Metropolis-Hastings

Proposal Distribution Examples

Code
# plot proposal distributions
p1 = plot!(deepcopy(p_base), x, pdf.(Normal(x_current, 0.1), x) ./ 10, color=:purple, label="Scaled Narrow Proposal", linewidth=2, linestyle=:dash)
p2 = plot!(deepcopy(p_base), x, pdf.(Normal(x_current, 0.5), x) ./ 2, color=:red, label="Scaled Wide Proposal", linewidth=2, linestyle=:dash)
plot!(p1, size=(600, 500))
plot!(p2, size=(600, 500))
display(p1)
display(p2)
(a) Example target density for Metropolis-Hastings
(b)
Figure 7

Sampling Efficiency

Two common measures of sampling efficiency:

  • Acceptance Rate: Rate at which proposals are accepted
    • “Optimally” 30-45% (depending on number of parameters)
  • Effective Sample Size (ESS): Accounts for autocorrelation \(\rho_t\) across samples \[N_\text{eff} = \frac{N}{1+2\sum_{t=1}^\infty \rho_t}\]

Sampling Efficiency Example

MCMC Sampling for Various Proposals

Autocorrelation of Chains

MCMC Sampling for Various Proposals

ESS by Proposal Variance for Example

MCMC Sampling for Various Proposals

Assessing Convergence

Transient Chain Portion

What do we do with the transient portion of the chain?

  • Discard as burn-in;
  • Just run the chain longer.

How To Identify Convergence?

Short answer: There is no guarantee! Judgement based on an accumulation of evidence from various heuristics.

  • The good news — getting the precise “right” end of the transient chain doesn’t matter.
  • If a few transient iterations remain, the effect will be washed out with a large enough post-convergence chain.

Heuristics for Convergence

Compare distribution (histogram/kernel density plot) after half of the chain to full chain.

2000 Iterations

10000 Iterations
Figure 8

Gelman-Rubin Diagnostic

Gelman & Rubin (1992)

  • Run multiple chains from “overdispersed” starting points
  • Compare intra-chain and inter-chain variances
  • Summarized as \(\hat{R}\) statistic: closer to 1 implies better convergence.
  • Can also check distributions across multiple chains vs. the half-chain check.

On Multiple Chains

Unless a specific scheme is used, multiple chains are not a solution for issues of convergence, as each individual chain needs to converge and have burn-in discarded/watered-down.

This means multiple chains are more useful for diagnostics, but once they’ve all been run long enough, can mix samples freely.

Heuristics for Convergence

  • If you’re more interested in the mean estimate, can also look at the its stability by iteration or the Monte Carlo standard error.
  • Look at traceplots; do you see sudden “jumps”?
  • When in doubt, run the chain longer.

Increasing Efficiency

Adaptive Metropolis-Hastings

Adjust proposal density to hit target acceptance rate.

  • Need to be cautious about detailed balance.
  • Typical strategy is to adapt for a portion of the initial chain (part of the burn-in), then run longer with that proposal.

Hamiltonian Monte Carlo

  • Idea: Use proposals which steer towards “typical set” without collapsing towards the mode (based on Hamiltonian vector field);
  • Requires gradient information: can be obtained through autodifferentiation; challenging for external models;
  • Can be very efficient due to potential for anti-correlated samples, but very sensitive to parameterization.
  • Will see how to use this next class.

Key Points, Upcoming Schedule, and References

Key Points (Metropolis-Hastings)

  • Construct ergodic and reversible Markov chains with posterior as stationary distribution.
  • Metropolis-Hastings: conceptually simple algorithm, but implementation plays a major role.
  • Proposal distribution plays a large role in acceptance rate and effective sample size.

Key Points (Convergence)

  • Must rely on “accumulation of evidence” from heuristics for determination about convergence to stationary distribution.
  • Transient portion of chain: Meh. Some people worry about this too much. Discard or run the chain longer.
  • Parallelizing solves few problems, but running multiple chains can be useful for diagnostics.

Next Classes

Wednesday: MCMC Examples

Monday: MCMC Lab (No exercises these weeks)

Next Wednesday: Literature Presentations (email slides by 9pm Tuesday night).

Assessments

  • Homework 3: Due 3/22

References

Gelman, A., & Rubin, D. B. (1992). Inference from Iterative Simulation Using Multiple Simulations. Stat. Sci., 7, 457–511. https://doi.org/10.1214/ss/1177011136
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. J. Chem. Phys., 21, 1087–1092. https://doi.org/10.1063/1.1699114